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Abstract 

This paper describes Artex, another algorithm for Automatic Text Summarization. 
In order to rank sentences, a simple inner product is calculated between each sentence, a 
document vector (text topic) and a lexical vector (vocabulary used by a sentence). Sum- 
maries are then generated by assembling the highest ranked sentences. No ruled-based 
linguistic post-processing is necessary in order to obtain summaries. Tests over several 
datasets (coming from Document Understanding Conferences (DUC), Text Analysis Con- 
ference (TAC), evaluation campaigns, etc.) in French, English and Spanish have shown 
that Artex summarizer achieves interesting results. 

Keywords: Automatic Text Summarization, Space Vector Model, Text extraction. Ultra- 
stemming 

1 Introduction 

Automatic Text Summarization (ATS) is the process to automatically generate a compressed 
version of a source document |15] . Query-oriented summaries focus on a user's request, and 
extract the information related to the specified topic given explicitly in the form of a query 
[5]. Generic mono-document summarization tries to cover as much as possible the informa- 
tion content. Multi-document summarization is a oriented task to create a summary from 
a heterogeneous set of documents on a focused topic. Over the past years, extensive exper- 
iments on query-oriented multi-document summarization have been carried out. Extractive 
Summarization produces summaries choosing a subset of representative sentences from original 
documents. Sentences are ordered and then assembled according to their relevance to generate 
the final summary |TU] . 

This article introduces a new method of summarization based in sentences extraction on 
Vector Space Model (VSM). We score each sentence by calculating their inner product with a 
pseudo-sentence vector and a pseudo-word vector. Results show that Artex not only preserves 
the content of the summaries generated using this new representation, but often, surprisingly 
the performance can be improved. Artex could be an interesting and simple algorithm using 
the extractive summarization paradigm. Our tests on trilingual corpora (English, Spanish 
and French) evaluated by the Fresa algorithm (without human references) confirm the good 
performance of Artex. 

In this paper, related work is given in Section [2j Section [3] presents the new algorithm of 
Automatic Text Summarization. Experiments are presented in Section |4] followed by Results 
in Section [5] and Conclusions in Section [6l 



2 Related works 



Research in Automatic Text Summarization was introduced by H.P. Lulin in 1958 [S]. In the 
strategy proposed by Luhn, the sentences are scored for their component word values as de- 
termined by tf*idf-Hke weights. Scored sentences are then ranked and selected from the top 
until some summary length threshold is reached. Finally, the summary is generated by assem- 
bling the selected sentences in the original source order. Although fairly simple, this extractive 
methodology is still used in current approaches. Later on, [3 extended this work by adding sim- 
ple heuristic features such as the position of sentences in the text or some key phrases indicate 
the importance of the sentences. As the range of possible features for source characterization 
widened, choosing appropriate features, feature weights and feature combinations have become 
a central issue. 

A natural way to tackle this problem is to consider sentence extraction as a classification 
task. To this end, several machine learning approaches that uses document-summary pairs 
have been proposed OHl], An hybrid method mixing statistical and linguistics algorithms is 
presented in [T]. [10] and [T5_ propose a good state-of-art of Automatic Text Summarization 
tasks and algorithms. 

2.1 Document Pre-processing 

The first step to represent documents in a suitable space is the pre-processing. As we use 
extractive summarization, documents have to be chunked into cohesive textual segments that 
will be assembled to produce the summary. Pre-processing is very important because the 
selection of segments is based on words or bigrams of words. The choice was made to split 
documents into full sentences, in this way obtaining textual segments that are likely to be 
grammatically corrects. Afterwards, sentences pass through several basic normalization steps 
in order to reduce computational complexity. 

The process is composed by the following steps: 

1. Sentence splitting. Simple rule-based method is used for sentence splitting. Documents 
are chunked at the period, exclamation and question mark. 

2. Sentence filtering. Words lowercased and cleared up from sloppy punctuation. Words 
with less than 2 occurrences (/ < 2) are eliminated (Hapax legomenon presents once in a 
document). Words that do not carry meaning such as functional or very common words 
are removed. Small stop-lists (depending of language) are used in this step. 

3. Word normalization. Remaining words are replaced by their canonical form using 
lemmatization, stemming, ultra-stemming or none of them (raw text). Four methods of 
normalization were applied after filtering: 

• Lemmatization by simples dictionaries of morphological families. These dictionaries 
have 1.32M, 208K and 316K words-entries in Spanish, English and French, respec- 
tively. 

• Porter's Stemming, available at Snowball (web site http : //snowball . tcirtarus . ' 
org/t exts/stemmersoverview . html) for English, Spanish, French among other lan- 
guages. 

• Ultra-stemming. This normalization seems be very efficient and it produces a com- 
pact matrix representation [IB]. Ultra-stemming consider only the n first letters of 



each word. For example, in the case of ultra-stemming (first letter, FiXi), inflected 
verbs like "sing", "song", "sings", "singing"... or proper names "smith", "snowboard", 
"sex",... are replaced by the letter "s". 

4. Text Vectorization. Documents are vectorized in a matrix S[pxn] of P sentences and 
N columns. Each element Sij represents the occurrences of an object j (a letter in the 
case of ultra-stemming, a word in the case of lemmatization or a stem for stemming), 
j — 1, 2, N in the sentence i, i — 1, 2, P. 



3 AnotheR TEXt summarizer (Artex) 

Artex^ is a simple extractive algorithm for Automatic Text Summarization. The main idea is 
the next one: First, we represent the text in a suitable space model (VSM). Then, we construct 
an average document vector that represents the average (the "global topic") of all sentences 
vectors. At the same time, we obtain the "lexical weight" for each sentence, i.e. the number of 
words in the sentence. After that, it is calculated the angle between the average document and 
each sentence; narrow angles indicate that the sentences near of the "global topic" should be 
important and therefore extracted. See on the figure [T] the VSM of words: P vector sentences 
and the average "global topic" are represented in a N dimensional space of words. 
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Figure 1: The "global topic" in a Vector Space Model of N words. 



Next, a score for each sentence is calculated using their proximity with the "global topic" 
and their "lexical weight". In the figure |2] the "lexical weight" is represented in a VSM of P 
sentences. 

Finally, the summary is generated concatenating the sentences with the highest scores fol- 
lowing their order in the original document. 



'^In French, Artex est un Autre Resumeur TEXtuel. 



VSM sentences 




Figure 2: The "lexical weight" in a Vector Space Model of P sentences. 



3.1 Algorithm 

Formally, Artex algorithm computes the score of each sentence by calculating the inner product 
between a sentence vector, an average pseudo-sentence vector (the "global topic") and an average 
pseudo-word vector (the "lexical weight"). 

Once a pre-processing (word normalization and filtering of stop words) is completed, it is 
created a matrix S[pxn] - using the Vector Space Model, that contains N words (or letters) and 
P sentences. 

Let Si = {si,S2, sn) be a vector of the sentence i, i — 1, 2, P. We defined a the average 
pseudo-word vector, as the average number of occurrences of N words used in the sentence i: 



N ■ 

3 



and b the average pseudo-sentence vector as the average number of occurrences of each word j 
used trough the P sentences: 



(2) ^i^^Yl ^^'1 
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The score or weight of each sentence Si is calculated as follows: 



(3) score(s,) = (^s x 6j x a = {^^^,3 ^ x ; i = 1,2, = 1,1, ...,A^ 

The score(») computed by equation |3] must be normalized between the interval [0,1]. The 
calculation of s x 6 indicates the proximity between the sentence s and the average pseudo- 
sentence b. The product (s x &) x a weigh this proximity using the average pseudo-word a^. 



If a sentence Si is near of b and their corresponding element has a high value, Si will have, 
therefore, a high score. Moreover, a sentence i far of main topic (i.e. Si x b is near 0) or a less 
informative sentence i (i.e. are near 0) will have a low score. 

In computational terms, it is not really necessary to divide the scalar product by the constant 
because the angle a = arccos 6.s/|&| |s| between 6 and s is the same if we use b = b' = Sij. 
The element is only a scale factor that does not modify a. 

In fact, if the matrix 5[pxAr] is approximated to a binary matrix^ ^[pxN] ' "^^i^re each element 

= {0, 1} has a probability of p = |, we can normalize vectors a, b and matrix S, as follows: 



(4) \s\ = Ev</^Ev«0'i>'^)'^^^ 
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Vectors then will be represented in hyper-spheres of N or P dimensions, and the normalized 
score' in this space would be: 



score'(si) = (t^x-^| x-^ = — I Si , x bj \ x 
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(7) = -^== lE^^jX^j) xa«;* = l'2,...,P;j = l,2,...,iV 

However, the term is a constant value (i.e. a simple scale factor), and then the 

score (•) calculated using the equation |3| and the score' (•) using the equation [t] are both 
equivalent . 



4 Experiments 

Artex algorithm described in the previous section has been implemented and evaluated in 
corpora in several languages. 

We have conducted our experimentation with the following languages, summarization tasks, 
summarizers and data sets: 1) Generic multi-document-summarization in English with the 
corpus DUC'04; 2) Generic single-document summarization in Spanish with the corpus Medicina 
Chnica and 3) Generic single document summarization in French with the corpus Pistes. 

We have applied the summarization algorithms and finally, the results have been evaluated 
using Fresa while processing times for each summarizer have been measured and compared. 

The following subsections present formally the details of the summarizers, corpora and 
evaluations studied in different experiments. 



■^This is a reasonable approximation in this context, because SjpxiV] ^ sparsed matrix with many term 
occurrences equal to one or zero. 



4.1 Other Summarizers 



To compare the performances, two other summarization systems were used in our experiments: 
Cortex and Enertex. To be in the same conditions, these two systems have used exactly the 
same textual representation based on Vector Space Model, described in Section |2.1[ 

• Cortex is a single-document summarization system using several metrics and an optimal 
decision algorithm [ij [Til [15] [T5] . 

• Enertex is a summarization system based in Textual Energy concept |S]: text is repre- 
sented as a spin system where spins '[ represents words that their occurrences are / > 1 
(spins I if the word is not present). 

4.2 Summarization Corpora Description 

To study the impact of our summarizer, we used corpora in three languages: English, Spanish 
and French. The corpora are heterogeneous, and different tasks are representatives of Auto- 
matic Text Summarization: generic multi-document summary and mono-document guided by 
a subject. 

• Corpus in English. Piloted by NIST in Document Understanding Conference'^ (DUC) the 
Task 2 of DUC'04'*, aims to produce a short summary of a cluster of related documents. 
We studied generic multi-document-summarization in English using data from DUC'04. 
This corpus with 300K words (17 780 types) is compound of 50 clusters, 10 documents 
each. 

• Corpus in Spanish. Generic single-document summarization using a corpus from the 
scientific journal Medicina Clmica^, which is composed of 50 medical articles in Spanish, 
each one with its corresponding author abstract. This corpus contains 125K words (9 657 
types). 

• Corpus in French. We have studied generic single-document summarization using the 
Canadian French Sociological Articles corpus, generated from the journal Perspectives 
interdisciplinaires sur le travail et la sante (Pistes)^. It contains 50 sociological articles 
in French, each one with its corresponding author abstract. This corpus contains near 
400K words (18 887 types). 

4.3 Summaries Content Evaluation 

DUC conferences have introduced the ROUGE content evaluation |7j , wich measures the over- 
lap of rt-grams between a candidate summary and reference summaries written by humans. 
However, to write the human summaries necessaries for ROUGE is a very expensive task. 

Recently metrics without references have been defined and experimented at DUC and Text 
Analysis Conferences (TAC)^ workshops. 

Fresa content evaluation [T7] is similar to ROUGE evaluation, but human reference 
summaries are not necessary. Fresa calculates the divergence of probabilities between the 

^http: //due .nist .go^ 

"^http: / /wMw-nlpir .n ist ■Kov/proiec ts/duc/guideliiies/2004. htmll 

"http : / /www . elsevier . es /revl stas/ctl_servlet?_f =7032&revistaid=2| 
^http : / / www. pistes .uqam. ca/ 
'iwww .nist . gov/tac. 



candidate summary and the document source. Among these metrics, KuUback-Leibler (KL) 
and Jensen-Shannon (JS) divergences have been widely used by [HI HZ] to evaluate the informa- 
tiveness of summaries. 

In this article, we use Fresa, based in KL divergence with Dirichlet smoothing, like in 
the 2010 and 2011 INEX edition [IT], to evaluate the informative content of summaries by 
comparing their n-gram distributions with those from source documents. 

Fresa only considered absolute log-difF between the terms occurrences of the source and 
the summary. Let T be the set of terms in the source. For every i e T, we denote by Cj^ its 
occurrences in the source and its occurrences in the summary. 

The Fresa package computed the divergence between the document source and the sum- 
maries as follows: 



To evaluate the information content (the "quality") of the generated summaries, after re- 
moving stop- words, several automatic measures were computed: FresAi (Unigrams of single 
stems), FRESA2 (Bigrams of pairs of consecutive stems), FresAs[/4 (Bigrams with 2-gaps also 
made of pairs of consecutive stems) and finally, (Fresa), i.e. the average of all Fresa values. 

The Fresa values (scores) are normalized between and 1. High Fresa values mean less 
divergence regarding the source document summary, reflecting a greater amount of information 
content. All summaries produced by the systems were evaluated automatically using Fresa 
package. 



5 Results 

In this section we present the results for each corpus with different summarizers and the sev- 
eral normalization strategies used. Based on these results, firstly, we have verified that ultra- 
stemming improves the performance of summarizers. Secondly, we show that Artex is a system 
that has a similar performances -in terms of information content and processing times- to other 
state-of-art summarizers. 



5.1 Content evaluation 

• English corpus. Figure [3] shows the performance of the three summarizers using FiXi, 
stemming and lemmatization. Results show that ultra-stemming improves the score of the 
three automatic summarizer systems. Artex and Cortex expose a similar performances 
in information content. 



(8) 




CORPUS: DUC'04 
(English - 50 Clusters Task 2, Generic summarization) 




Figure 3: Histogram plot of content evaluation for corpus DUC'04 Task 2, with (Fresa) mea- 
sures, for each summarizer and each normalization. 



Spanish corpus. Spanish is a language with a greater variability than English. Results in 
figure |4] shown that Artex summarizer outperforms Cortex and Enertex if stemming 
or lemmatization are used as normalization. 
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Figure 4: Histogram plot of content evaluation for Spanish corpus Medicina Clmica with 
(Fresa) scores for each summarizer. 



• French corpus. French is a language with a large variability too. Figure [5] shows the 
score (Fresa) on the French corpus Pistes. Results show a similar behavior: Ultra- 
stemming improves the score of the three automatic summarization systems used. In 
particular, the efficacy of Artex is less sensible to word normalization than others sum- 
marizers. 
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Figure 5: Histogram plot of content evaluation for French corpus Pistes with (Fresa) scores 
for each summarizer. 

5.2 Processing Times Evaluation 

Table[T]shows processing times for each corpus, following the normalization method for Cortex, 
Artex and Enertex summarizers*. Processing times of ultra-stemming FiXi are shorter 
compared to all others methods. By example. Cortex is a very fast summarizer with O(logp^) 
(where p = P x N), and processing times for stemming and FiXi are close. In other hand, 
Enertex summarizer has a complexity of 0{p^). then it needs more time to process the same 
corpus. Performances of Artex algorithm remain close to Cortex. 





Summarizer Average Time 
(all corpora) 


Normalization 


Cortex Artex Enertex 


Lemmatization 
Stemming 

FIXi 


1.60' 2.50' 10.42' 
0.54' 1.29' 9.47' 
0.32' 0.40' 4.25' 



Table 1: Statistics of processing times (in minutes) of three summarizers over three corpora. 



*A11 times are measured in a 7.8 GB of RAM computer, Core i7-2640M CPU @ 2. 80GHz X 4 processor, 
running under 32 bits GNU/Linux (Ubuntu Version 12.04). 



6 Conclusions 



In this article we have introduced and tested a simple method for Automatic Text Summa- 
rization. Artex is a fast and very simple algorithm based in VSM model and the extractive 
paradigm. The method uses a matrix representation to calculate a normalized score for each 
sentence, using the inner product of pseudo-(sentences|words) vectors. The algorithm retains 
the salient information of each sentence of document. An important aspect of our approach is 
that it does not requires linguistic knowledge or resources which makes it a simple and efficient 
summarizer method to tackle the issue of Automatic Text Summarization. 

Summaries generated by Artex system are pertinents. The results obtained on corpora 
in English, Spanish and French show that Artex can achieve good results for content quality. 
Tests with other corpora (DUG and TAG evaluation campaigns, INEX, etc.) in mono-and 
multi-document guided by a subject, using content evaluation with (ROUGE evaluations) or 
without reference summaries still in progress. 
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